UK Road Safety: Traffic Accidents and Vehicles

The goal of this project is the investigate what causes Serious and Fatal accidents in hopes of preventing and decreasing the number of them. The dataset consists of accident records from the UK over the course of 15+ years. I hope to show the causes of these accidents through visualizations and create an algorithm that can predict the severity of accidents.

The UK government collects and publishes (usually on an annual basis) detailed information about traffic accidents across the country. This information includes, but is not limited to, geographical locations, weather conditions, type of vehicles, number of casualties and vehicle manoeuvres, making this a very interesting and comprehensive dataset for analysis and research.

The data that I'm using is compiled and available through Kaggle and in a less compliled form, here.

Genesis L. Taylor
Github | Linkedin | Tableau | genesisltaylor@gmail.com

Problem: Traffic Accidents
Solution Method: Use data to figure out how to lower the number of accidents and the severity of them.

Table of Contents

UK Road Safety: Traffic Accidents and Vehicles Introduction, Data Cleaning, and Feature Manipulation
UK Road Safety: Traffic Accidents and Vehicles Introduction, Data Cleaning, and Feature Manipulation: Github Link

UK Road Safety: Traffic Accidents and Vehicles Visualizations and Solution
UK Road Safety: Traffic Accidents and Vehicles Visualizations and Solution: Github Link

UK Road Safety: Traffic Accidents and Vehicles Machine Learning
UK Road Safety: Traffic Accidents and Vehicles Machine Learning: Github Link

Traffic Analysis and Severity Prediction Powerpoint Presentation
Traffic Analysis and Severity Prediction Powerpoint Presentation: Github Link

Importing and Data Merging

Previously, I did not remove "Data missing or out of range" from the datasets however through cleaning and checking the value counts I decided to do so for sanity purposes only. Most of the percentages that had this as a value were not a high percentage either.

Data Cleaning

Nulls and Outliers

2nd_road_class

With 40% of non null being unclassified and 39% of the overall 2nd_road_class column being null, I have decided to drop it in it's entirely.

driver_imd_decile

Since the distribution of categories for 'driver_imd_decile seem very similar, I've decided not to use the mode but "method='ffill'"

age_of_vehicle

Changing the nulls of "age of vehicle" to median, then creating it as a category

Model

Knowing that there are 28824 unique models for the model column I have decided to use the ffill method on it as well.

Note: A lot of the values of "model' are labeled as "missing". I do not want to change these because the model could have actually been missing from the car from the accident or it could not be recognizable at the time of the accident.

engine_capacity_cc

I am going to handle both outliers and the null values of engine_capacity_cc using the ideals of quantiles and the interquartile range (IQR).

To explain, what I am going to do is use the ecmax number for the maximum engine_capacity_cc and ecmin for my engine_capacity_cc. Then I'm going to take the mean of those and use it as my fillna.

I can accept this distribution and will now check and handle their nulls

Going to round this mean value

Note: After doing the above null fixes, propulsion_code dropped from having 10% null values to 0. (see below). I will continue on and fix lsoa_of_accident_location then drop the rest of the null values with are all <5%.

lsoa_of_accident_location

With 35061 unique variable and a high count amount the top variables I am deciding to do ffill again.

Dropping the remaining nulls that are <1%.

More outliers, categorizing, and other cleanup

Column 'speed_limit' seems ok and was previously altered 'engine_capacity_cc'. However, 'number_of_casualties', and 'number_of_vehicles',will be evaluated.

Feature Manipulation Creation and Engineering

I want to condense the vehicle type variables.

Create more condense groups for age band of driver in order to deal with some potential outliers.

General Visualizations

Fridays are the day of the week where the most accidents occur.

Accident Forecasting with Tableau

According to the forecasting above, traffic accidents will be slightly lower than years before but following similar trends throughout the months.

Below is a screenshot of the above forecasting. I put this here just in case there is trouble viewing it. If you would like to view the actual worksheet for it, please click here.

timeforecastingscreenshot.PNG

Correlations

For correlation I used both Pearson and Spearman just in case there would be discrepancies. The order may have slightly varied but the "highest" correlated remained the same.

Looking at this I wanted to visualize some of the higher pos/negative correlations against accident severity.

Chi-Squared Test

Before these visualizations were done, I wanted to be sure that the visualizations were of some importance to accident_seriousness. For this, the chi-squared test was used.

Visualizations In Relation to Accident Seriousness

Method:

For my visualizations I have decided to use some of the features with the highest correlations to accident_seriousness:


Note: The columns used were selected because of the absolute value of their correlation in relation to accident_seriousness

*columns added after correlation was done after undersampling

For visual reasons, two separate dataframes were created, for not serious and serious accidents. I wanted to better scale the data and for me, this was the simplest way of doing so.

Did Police Officer Attend Scene Of Accident

The below plots will look into if police officers attended the scene of an accident.

First Point of Impact

The below plots show the counts for the first spot in which vehicles were hit in an accident

Number of Vehicles

The below plots show the counts for number of vehicles in each accident.

Spe#### Speed Limit vs Accident Seriousness

The below graphs show the speed limit by accident in areas where the accidents occured. ed Limit vs Accident Seriousness

Urban or Rural Area vs Accident Seriousness

The graphs below show whether the accidents occured in an Urban or Rural Area.

Skidding and Overturning vs Seriousness

The below graphs show if any skidding, jackniffing, and/or overturnning occured in the acccident.

Vehicle Leaving Carriageway vs Seriousness

The below graphs show if a vehicle left the carriageway, and if they did, where did they do so.

Sex of Driver vs Seriousness

The below graphs show the sex of the drivers in the accidents.

Vehicle Type vs Seriousness

The graphs below are about the number of accidents by type of vehicle.

Vehicle Manoeuvres

The graphs below depict the types of moves vehicles made that led to the accident.

Driver Home Type Area

This area is another look at the type of area the accident occured in, whether Rural, Urban, or Small Town.

Age Band of Driver

Thr graphs below show accidents by age groups of the drivers.

Junction Control

The following graphs show what type of traffic signs or signals were up in the accident area, if any.

Hit Object Off Carriageway

The following graphs show if a vehicle hit an object off of the road and what object, if they hit one during the accident.

Hit Object In Carriageway

The following graphs show if a vehicle hit an object on the road and what object, if they hit one during the accident.

Driver IMD Decile

The Driver IMD Decile is the score for the deprivation of an area. The graphs below show accidents by how deprived an area was at the time of the accident.

Junction Detail

The following graphs show the road features in relations to where the accidents occured.

Junction Location

The graphs below show where the accidents occured on the roads.

Propulsion Code

The propulsion ode is the type of fuel used to power the car. The graphs below show what type of fuel was used in the vehicles in the accidents.

Year

The year of the accidents.

Visualization Summary

Other Visualizations

Due to the previous visualizations a comparison of certain variables was desired to see more correlations.

Junction Control by Junction Detail

The following graph shows what type of traffic control were in specific areas of the road where accidents occured.

Junction Control by Junction Location

The graph below is a more detailed look at junction areas in relation to the accidents.

First point of Impact by Junction Detail

The graph below shows where impact first occured in the detailed road area type.

First point of Impact by Junction Location

The graph below shows where the accident occured and what was the first point of impact.

Junction Control and First Point of Impact

The following graph shows what type of traffic controls (signange or otherwise) were present at the first point of impact.

Other Visualizations Summary

No matter the situation above, the most accidents were involving areas that were uncontrolled. One of the main ones were the junction Detail T or staggered junction.

Other areas of concern include Mid Junctions on roundabouts or main roads and areas approaching a junction were cars were either parking or waiting in the junction.

Solution

From the data above more controlled areas would be benefical. Maybe signs alerting drivers of the upcoming junctions, traffic lights, or stop signs would help in some of these areas where they are feasible.

staggered-junctions.jpg

For example, this is a staggered junction, the main junction detail in accidents. One can understand how a situation such as these can lead to numerous accidents especially if proper signage is not available. Perhaps traffic lights, stop signs, or warnings indicating that they are approaching certain junctions would help reduce accidents.

Web Scraping

Below you wll find a web scrape of the website, Learner Driving Centres, which contains information on road signs in the UK. They were pulled to show examples of signage available to be placed.

Mapping of Problem Areas

Below we used Tableau to map what could be deemed problem areas for the UK. These are accidents in areas with high deprivation (driver_imd_decile @ more deprived 40-50%) and no signange at T or staggered junctions.

Below is a screenshot of the above mapping. I put this here just in case there is trouble viewing it. If you would like to view the actual worksheet for it, please click here.

mapping.PNG

Machine Learning

Preprocessing

Label Encoder was used instead of OneHotEncoder due to the memory errors One Hot Encoder caused in the data. The algorithms used will be classifiers, through boosting and trees, and not linear.

Imbalanced Data

The data in this dataset is extremely imbalanced for what we are trying to predict. We are going to resample the data as undersampling, where we reduce the number of majority (Not Serious Accidents) samples.


The machine learning classifier algorithms that we are going to use are as follows:


*Gradient Boosting was commented out because of the time it took to run (18hrs) and not having relevant enough results to still consider.

Resample: Undersampling

Unsupervised Learning

Before, we get in to predictions, we are going to complete some machine learning in ordered to see how the data relates to each other. We are going to do this on the resampled data as well, in order to avoid bias. We will use two clusters which, in theory, represent the two variables for accident_seriousness, Not Serious & Serious

Looking at these graphs we can see the patterns of how each category of eacch column pairs off with the clustering on accident_seriousness.

Supervised Learning with Resampling as Undersampling

Method 1

First, we are going to run some standard classifier algorithms using the resampling method from above, gather the results of some scoring metrics (Accuracy, Log Loss, Cross Validation, Recall, Roc Auc, F1, False Positive Rate, Error Rate), and put those scores into a dataframe

Method 2

For the following Balanced algorithms from imblearn we will be using the standard testing and training sets (X_train, X_test, y_train, y_test) and will allow the algorithms to do the resampling.

For the sampling_strategy, we will be using majority as the solution.

'majority': resample only the majority class

We will then gather the results of some scoring metrics (Accuracy, Log Loss, Cross Validation, Recall, Roc Auc, F1, False Positive Rate, Error Rate), and put those scores into a dataframe.

We will now combine the dataframes from both methods into one datframe for analyzing and visualizations

Choice

Based on the visualizations above, Balanced Bagging Classifier from imblearn is the algorithm of choice for this data. While some of the scores may have been close, Balanced Bagging Classifier had higher scores in Accuracy, Cross Validation, and Specificity. The algorithm also had the lower Error Rate and False Positive Rates of the group.

Balanced Bagging Classifier with LightGBM

Balanced Bagging Classifier performed thest best of the classifiers, however, I was not comfortable with how close its predictions were for Serious Accidents in the confusion matrix. Due to this, I decided to combine Balanced Bagging Classifier with the second highest performing algorithm, LightGBM to see what results I would get.

The results were better than the other learning algorithms but lower accuracy wise than the previous Balanced Bagging Algorithm. Taking all of that into consideration, I have decided that depending on what was the goal, either Balanced Bagging Classifier algorithm could be used. If I were more concerned with overall accuracy, the regular Balanced Bagging Classifier would be used. If I were more concerned with making sure "Serious" predictions were achieved, Balanced Bagging Classifier with LightGBM would be used.

Genesis L. Taylor
Github | Linkedin | Tableau | genesisltaylor@gmail.com